Implement SVE2 non-temporal gather loads #123890

ylpoonlg · 2026-02-02T15:42:15Z

Contributes to #94019 .

@dotnet/arm64-contrib @a74nh

dotnet-policy-service · 2026-02-02T15:43:26Z

Tagging subscribers to this area: @dotnet/area-system-runtime-intrinsics
See info in area-owners.md if you want to be subscribed.

stephentoub · 2026-02-11T19:19:53Z

🤖 Copilot Code Review — PR #123890

Holistic Assessment

Motivation: This PR adds SVE2 non-temporal gather load intrinsics (LDNT1B, LDNT1SH, LDNT1SW, LDNT1H, LDNT1W, LDNT1D), contributing to issue #94019. Non-temporal loads are valuable for streaming data access patterns where cache pollution should be avoided. The need for these APIs is justified by the approved API proposal.

Approach: The implementation follows the established pattern for SVE gather loads, adding 12 new intrinsic entries in the JIT tables, corresponding managed APIs in Sve2.cs, and codegen support. The approach of manually emitting LSL instructions to compute byte offsets from element indices is correct since SVE2 LDNT1 instructions (unlike standard LD1 gathers) don't support embedded shift immediates.

Summary: ⚠️ Needs Changes. The implementation has one correctness issue in the codegen and lacks active test coverage. These should be addressed before merge.

Detailed Findings

❌ Codegen correctness — `GatherVectorNonTemporal` may fail for 4-byte elements

File: src/coreclr/jit/hwintrinsiccodegenarm64.cpp (lines 2362-2366)

The codegen for NI_Sve2_GatherVectorNonTemporal has this logic:

else if (intrin.id == NI_Sve2_GatherVectorNonTemporal)
{
    assert(emitActualTypeSize(intrin.baseType) == 8);
    GetEmitter()->emitIns_R_R_I(INS_sve_lsl, emitSize, op3Reg, op3Reg, 3, opt);
}

This asserts that the base type is 8 bytes and always shifts by 3. However, looking at the intrinsic table in hwintrinsiclistarm64sve.h, GatherVectorNonTemporal maps to:

INS_sve_ldnt1w for int/uint/float (4-byte elements)
INS_sve_ldnt1d for long/ulong/double (8-byte elements)

Verification: The managed APIs in Sve2.cs show that the 4-byte Vector-based overloads (e.g., GatherVectorNonTemporal(Vector<int>, Vector<uint>)) are intentionally commented out per #103297. However, the GatherVectorWithByteOffsetsNonTemporal APIs for 4-byte types (Vector<int>, Vector<float>) are exposed. Since the intrinsic table entry for GatherVectorWithByteOffsetsNonTemporal uses the same instruction (ldnt1w/ldnt1d), this case must be handled correctly.

Looking more carefully at the codegen case list at lines 2334-2345, both NI_Sve2_GatherVectorNonTemporal and NI_Sve2_GatherVectorWithByteOffsetsNonTemporal share the same case block. For the WithByteOffsets variant, no shift is needed (the offsets are already in bytes). However, GatherVectorNonTemporal with indices does need shifting.

Issue: If GatherVectorNonTemporal is called with 4-byte types (int/float) through overloads like GatherVectorNonTemporal(Vector<int>, int*, Vector<uint>) — which is exposed in the managed API — the current codegen will hit the assert and fail, or compute incorrect addresses.

Recommendation: Add handling for the 4-byte case:

else if (intrin.id == NI_Sve2_GatherVectorNonTemporal)
{
    emitAttr baseSize = emitActualTypeSize(intrin.baseType);
    if (baseSize == EA_4BYTE)
    {
        GetEmitter()->emitIns_R_R_I(INS_sve_lsl, emitSize, op3Reg, op3Reg, 2, opt);
    }
    else
    {
        assert(baseSize == EA_8BYTE);
        GetEmitter()->emitIns_R_R_I(INS_sve_lsl, emitSize, op3Reg, op3Reg, 3, opt);
    }
}

⚠️ Test coverage — Tests are commented out

File: src/tests/Common/GenerateHWIntrinsicTests/Arm/Sve2Tests.cs

The diff shows test entries being added but they are commented out:

// (Templates.SveGatherVectorVectorBases,new Dictionary<string, string> {["TestName"] = "Sve2_GatherVectorByteZeroExtendNonTemporal_Bases_int_uint", ...

While I understand that the Vector-bases overloads were removed per #103297, there should be active tests for the overloads that are being shipped. Currently there appear to be no active test cases for the new APIs.

Recommendation: Enable test coverage for at least:

The (mask, pointer, offsets) overloads for each intrinsic variant
The (mask, Vector<ulong> addresses) overloads for 64-bit element types
Both 4-byte and 8-byte variants to catch the codegen issue above

⚠️ API consistency — Commented-out overloads citing #103297

Files: Sve2.cs, Sve2.PlatformNotSupported.cs

Multiple overloads are commented out with references to issue #103297:

// Removed as per #103297
// public static unsafe Vector<int> GatherVectorNonTemporal(Vector<int> mask, Vector<uint> addresses) => ...

This is fine if intentional, but note that the original API proposal in #94019 included these overloads. Consider:

Adding a tracking issue or linking to Arm64/Sve: Implement the SVE APIs that takes 32-bit address as input #103297 in the PR description to explain why these overloads are excluded
Ensuring the decision to exclude 32-bit Vector-base overloads is documented for future reference

💡 Suggestion — Consider factoring offset-scaling logic

The new codegen block (lines 2346-2378) duplicates the offset-scaling pattern already present for existing SVE gather intrinsics (lines 2252-2310). Consider consolidating this logic into a shared helper or extending the existing switch to cover both SVE and SVE2 gather variants, reducing future drift risk.

✅ Register ordering — Correct for SVE2 LDNT1x encoding

The comment "op2Reg and op3Reg are swapped" is correct. The SVE2 non-temporal gather instructions use the IF_SVE_IX_4A encoding which expects (Target, Pg, VectorOffset, ScalarBase) ordering. Since the intrinsic signature is (Mask, Base, Offsets), swapping the scalar base (op2) and vector offsets (op3) in the emission is necessary and correct.

Models Contributing to This Review

This review was synthesized from analysis by multiple models (Gemini 3 Pro, GPT-5.1-Codex) along with direct code inspection. Both models independently flagged the 4-byte GatherVectorNonTemporal codegen issue as a critical bug.

ylpoonlg · 2026-02-12T11:58:35Z

❌ Codegen correctness — GatherVectorNonTemporal may fail for 4-byte elements

These APIs with 4-byte addresses are commented out so they cannot be tested currently, but adding them for completeness.

⚠️ Test coverage — Tests are commented out
⚠️ API consistency — Commented-out overloads citing #103297

All commented tests/APIs should be related to #103297.

Implement SVE2 non-temporal gather loads

f22c7a8

github-actions bot added the area-System.Runtime.Intrinsics label Feb 2, 2026

dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Feb 2, 2026

This was referenced Feb 2, 2026

iOS.Device test WorkItemExecutions #122874

Open

[Wasm] The request was canceled due to the configured HttpClient.Timeout of 100 seconds elapsing #123572

Open

XHarness package install failure on iOS due to devicectl NSPOSIXErrorDomain error 49 #123796

Open

Fix codegen for 4-byte GatherVectorNonTemporal

87dcb41

Merge branch 'main' into github-sve2_gatherload

cfcff3f

build-analysis bot mentioned this pull request Feb 12, 2026

Cannot find 'arm64-v8a' device dotnet/dnceng#2284

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement SVE2 non-temporal gather loads #123890

Implement SVE2 non-temporal gather loads #123890

Uh oh!

ylpoonlg commented Feb 2, 2026

Uh oh!

dotnet-policy-service bot commented Feb 2, 2026

Uh oh!

stephentoub commented Feb 11, 2026

Uh oh!

ylpoonlg commented Feb 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Implement SVE2 non-temporal gather loads #123890

Are you sure you want to change the base?

Implement SVE2 non-temporal gather loads #123890

Uh oh!

Conversation

ylpoonlg commented Feb 2, 2026

Uh oh!

dotnet-policy-service bot commented Feb 2, 2026

Uh oh!

stephentoub commented Feb 11, 2026

🤖 Copilot Code Review — PR #123890

Holistic Assessment

Detailed Findings

❌ Codegen correctness — GatherVectorNonTemporal may fail for 4-byte elements

⚠️ Test coverage — Tests are commented out

⚠️ API consistency — Commented-out overloads citing #103297

💡 Suggestion — Consider factoring offset-scaling logic

✅ Register ordering — Correct for SVE2 LDNT1x encoding

Models Contributing to This Review

Uh oh!

ylpoonlg commented Feb 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

❌ Codegen correctness — `GatherVectorNonTemporal` may fail for 4-byte elements